Statistical Stemming of Morphologically Rich Languages
نویسندگان
چکیده
We analyze current machine translation into Russian, a morphologically rich language, and present a technique for unsupervised statistical stemming. An initial pass is based on intuitions, and uses a CUDA kernel. As a later pass, we run EM. Since our model is relatively simple, the EM probabilities factorize into components that can be solved independently. While our results were not exactly grammatical, they appear to cluster grammatical classes, occasionally with some amount of lexicalization (extra characters preceding the word).
منابع مشابه
Word Semantic Similarity for Morphologically Rich Languages
In this work, we investigate the role of morphology on the performance of semantic similarity for morphologically rich languages, such as German and Greek. The challenge in processing languages with richer morphology than English, lies in reducing estimation error while addressing the semantic distortion introduced by a stemmer or a lemmatiser. For this purpose, we propose a methodology for sel...
متن کاملImproving Translation to Morphologically Rich Languages (Améliorer la traduction des langages morphologiquement riches) [in French]
Améliorer la traduction des langages morphologiquement riches While statistical techniques for machine translation have made significant progress in the last 20 years, results for translating to morphologically rich languages are still mixed versus previous generation rule-based systems. Current research in statistical techniques for translating to morphologically rich languages varies greatly ...
متن کاملUnsupervised Formation Matching in Highly Inflected Languages
There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective in a number of languages, particularly highly inflected languages. In this paper we propose a method for finding affixes in different positions of a word. Com...
متن کاملTemplate based affix stemmer for a morphologically rich language
Word stemming is one of the most significant factors that affect the performance of a Natural Language Processing (NLP) application such as Information Retrieval (IR) system, part of speech tagging, machine translation system and syntactic parsing. Urdu language raises several challenges to NLP largely due to its rich morphology. In Urdu language, stemming process is different as compared to th...
متن کاملLanguage-Specific Sentiment Analysis in Morphologically Rich Languages
In this paper, we propose languagespecific methods of sentiment analysis in morphologically rich languages. In contrast of previous works confined to statistical methods, we make use of various linguistic features effectively. In particular, we make chunk structures by using the dependence relations of morpheme sequences to restrain semantic scope of influence of opinionated terms. In conclusio...
متن کامل